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1.  SUMMARY 


The  current  project  titled  “Deep  Reading  and  Learning”  explored  several  algorithms  and  ap¬ 
proaches  for  knowledge  base  population  from  natural  language  texts.  The  central  problem  ad¬ 
dressed  is  to  extract  and  infer  factual  event  data  from  natural  texts  in  a  form  that  can  be  asserted 
into  a  knowledge  base.  Building  on  some  of  the  core  natural  language  processing  (NLP)  technol¬ 
ogy  from  Stanford  and  other  places,  we  developed  new  algorithms  and  state-of-the-art  software 
for  many  subtasks  of  NLP  starting  from  lower  level  tasks  such  as  part  of  speech  tagging  to  higher 
level  tasks  such  as  script  learning.  We  published  our  work  in  conferences  such  as  ICML,  AAAI, 
EMNLP  and  ACL  and  journals  such  as  JAIR  and  JMLR. 

Our  project  takes  to  heart  the  point  of  view  that  understanding  text  consists  of  extracting  facts 
and  representing  them  in  a  formal  language  ready  to  be  added  to  a  knowledge  base.  Given  various 
kinds  of  ambiguities  of  natural  texts  and  the  incomplete  understanding  of  grammatical  structure, 
semantics,  and  pragmatics  of  natural  languages,  this  is  indeed  a  daunting  task.  Nevertheless,  we 
made  significant  progress  on  several  subtasks  of  NLP  including,  part  of  speech  tagging,  chunking, 
named  entity  recognition,  co-reference  resolution,  linking,  event  detection,  event-argument  extrac¬ 
tion,  and  script  learning.  The  key  technology  that  enabled  our  success  is  our  HC-Search  algorithm 
based  on  search-based  structured  prediction.  Almost  all  tasks  in  NLP  can  be  viewed  as  mapping  a 
structured  input,  e.g.,  a  sentence  or  a  document,  into  a  structured  output,  e.g.,  a  graph  or  a 
knowledge  base.  The  problem  of  learning  this  mapping  from  supervisory  training  data  is  called 
structured  prediction.  In  search-based  structured  prediction,  this  mapping  is  constructed  incremen¬ 
tally  via  search.  HC-Search  in  particular  formulates  the  problem  as  learning  a  cost  function  C  and 
a  heuristic  function  H  such  that  the  correct  output  has  the  least  cost  C  and  is  reached  by  a  search 
algorithm  guided  by  the  heuristic  function  H.  Significant  contributions  of  our  project  include  the 
following. 

1.  In  an  early  paper  in  AAAI  2013  which  received  an  outstanding  paper  award,  we  showed 
the  generality  and  effectiveness  of  the  HC-Search  framework  in  a  number  of  tasks  includ¬ 
ing  part  of  speech  tagging  and  chunking  obtaining  state  of  the  art  results. 

2.  We  advanced  the  state  of  the  art  in  co-reference  resolution  using  a  pruning  enhancement 
of  search-based  structured  prediction. 

3.  We  formulated  within-document  and  cross-document  coreference  problems  as  non-con- 
vex  optimization  and  solved  them  using  a  Majorizati on-Minimization  algorithm. 

4.  We  developed  an  approach  to  detect  multi-word  event  nuggets  using  a  novel  forward- 
backward  recurrent  neural  network  architecture  with  state  of  the  art  results. 

5.  We  developed  a  new  approach  for  script  learning  based  on  Hidden  Markov  Models. 

6.  We  developed  a  new  multi-task  structured  prediction  framework  and  evaluated  it  in  sev¬ 
eral  NLP  tasks  such  as  named  entity  recognition,  co-reference  resolution  and  entity  link¬ 
ing. 

7.  We  participated  in  several  TAC  competitions  including  the  last  one  in  2016  on  Tri-lingual 
Entity  Discovery  and  Linking. 
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2.  INTRODUCTION 


The  goal  of  this  project  was  to  contribute  to  the  next  generation  of  software  tools  needed  to 
perform  deep  understanding  of  natural  language  texts.  Over  the  years,  the  natural  language  com¬ 
munity  has  built  an  impressive  array  of  tools  that  are  routinely  used  by  researchers  and  developers. 
Our  own  work  leveraged  and  built  upon  a  variety  of  NLP  tools  that  are  widely  available,  most 
importantly  Stanford’s  Core  NLP  toolkit  (Manning  et  al.,  2014).  In  spite  of  the  availability  of  many 
tools,  research  and  software  in  NLP  is  not  at  a  stage  that  can  be  used  by  practitioners  for  extracting 
knowledge  from  texts  and  populating  a  knowledge  base.  The  goal  of  our  work  was  to  develop  new 
algorithms  and  software  that  can  push  the  state  of  the  art  in  higher  level  language  processing  tasks 
such  as  co-reference  resolution,  event  detection,  and  script  learning  towards  building  formal  mean¬ 
ing  representations  that  can  be  queried. 

Early  work  in  natural  language  processing  emphasized  deep  comprehension  and  underscored 
the  need  of  commonsense  world  knowledge  to  understand  text  based  on  the  context  (Wilks  and 
Charniak,  1976;  Schank  and  Abelson,  1977).  However,  in  recent  work,  the  emphasis  shifted  to 
learning-based  approaches  that  exploit  large  amounts  of  data  to  learn  parameters  for  solving  rela¬ 
tively  lower-level  tasks  such  as  part-of-speech  tagging,  shallow  parsing,  word  sense  disambigua¬ 
tion,  and  semantic  role  labeling.  This  focus  was  driven  both  by  the  empirical  success  of  statistical 
learning  methods  and  the  challenges  of  formalizing  and  reasoning  with  large  amounts  of  world 
knowledge.  Our  project  falls  squarely  in  the  empirical  paradigm,  but  is  also  inspired  by  and  con¬ 
tributes  to  learning  higher- level  knowledge  in  the  form  of  event  scripts  and  explores  computational 
frameworks  that  combine  learning  and  search  which  can  be  employed  in  multiple  NLP  and  non- 
NLP  tasks. 

Many  tasks  in  natural  language  processing  can  be  formulated  as  structured  prediction,  which 
transforms  a  structured  input  to  a  structured  output  using  a  mapping  function  learned  from  training 
data.  Examples  include  detecting  mentions  of  noun  phrases  from  the  document,  identifying  co¬ 
reference  relationships  between  mentions,  linking  them  to  entities  in  the  knowledge  base,  detecting 
events  in  the  document,  identifying  their  types  and  arguments,  and  so  on.  Importantly,  the  learning 
system  does  not  produce  a  single  label  as  in  a  typical  classification  application  such  as  face  recog¬ 
nition,  but  needs  to  construct  a  coherent  structured  output  based  on  structured  input.  In  general, 
the  task  involves  making  many  small  decisions  to  produce  a  structured  output  that  is  globally  co¬ 
herent  and  consistent  with  the  input,  which  in  itself  is  structured,  noisy,  and  ambiguous. 


3.  METHODS,  ASSUMPTIONS  AND  PROCEDURES 

Our  general  approach  is  in  the  framework  of  search-based  structured  prediction,  which  em¬ 
ploys  search  algorithms  to  construct  a  suitable  output  that  optimizes  a  global  coherence  score.  In 
addition  to  the  coherence  score,  the  search  algorithms  require  heuristics  to  guide  the  search.  We 
developed  a  search-based  framework  called  HC-Search  that  employs  a  combination  of  a  heuristic 
and  a  scoring  function  in  the  context  of  limited  discrepancy  search  and  achieved  state  of  the  art 
results  in  a  number  of  domains  including  part  of  speech  tagging  and  chunking  (Doppa  et  al.  2013). 
We  later  extended  this  work  with  a  pruning  heuristic  under  the  name  of  Prune-and-Score  and  ap¬ 
plied  it  to  within-document  co-reference  resolution  with  state-of-the-art  results  (Ma  et  al.  2014). 
We  also  studied  cross-document  and  within-document  co-reference  resolution  in  the  Easy  First 
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search-based  structured  prediction  with  state  of  the  art  results  (Xie,  et  al.  2015).  The  novelty  here 
is  its  formulation  based  on  convex-concave  constrained  programming  (CCCP)  which  can  be 
solved  by  a  majorization-minimization  approach. 

A  second  class  of  problems  we  addressed  is  related  to  detecting  events  in  texts  and  learning 
patterns  among  them.  We  developed  a  new  neural  architecture  based  on  recurrent  neural  networks 
to  detect  event  nuggets  that  span  across  multiple  words  (Ghaeini  et  al.  2016).  Our  architecture 
employed  a  novel  recurrent  neural  network  that  is  processed  in  both  forward  and  backward  direc¬ 
tions  around  the  potential  nugget  words.  We  also  investigated  a  novel  algorithm  for  learning  mod¬ 
els  of  scripts  or  stereotypical  event  sequences  in  the  form  of  Hidden  Markov  Models  using  an  EM- 
style  algorithm  (Orr  et  al.  2014).  The  novelty  here  is  to  appropriately  account  for  missing  obser¬ 
vations  which  are  common  in  most  natural  language  texts.  Our  approach  was  the  first  use  of  HMMs 
for  representing  and  learning  scripts,  and  it  improved  upon  several  baselines  on  a  benchmark  da¬ 
taset. 

In  more  recent  work,  we  developed  a  new  multitask  structured  prediction  framework  and  ap¬ 
plied  it  to  simultaneously  solve  multiple  NLP  tasks,  including  named  entity  recognition,  co-refer¬ 
ence  resolution,  and  entity  linking.  The  key  idea  here  is  to  cycle  through  different  structured  pre¬ 
diction  tasks  one  after  another  until  they  all  converge  to  a  locally  optimal  solution.  This  takes 
advantage  of  relative  independence  between  different  tasks  to  speed  up  the  search  while  also  ex¬ 
ploiting  their  mutual  constraints  to  improve  global  coherence  of  the  solution  (Ma  et  al.  2017). 

In  addition  to  these  research  works,  we  also  participated  in  several  TAC  competitions  on  entity 
detection  and  linking,  and  event- argument  extraction,  culminating  in  the  trilingual  entity  detection 
and  linking  task  in  2016. 


4.  RESULTS  AND  DISCUSSION 

In  this  section,  we  detail  our  different  research  contributions  and  the  results  on  multiple  problems 
addressed  in  the  project. 

4.1.  Search-based  Structured  Prediction 

As  noted  earlier,  many  tasks  in  natural  language  processing,  from  part  of  speech  tagging  to  entity 
linking,  can  be  formulated  as  structured  prediction ,  or  transforming  structured  inputs  to  structured 
outputs  (Daume  et  al.  2009).  Our  version  of  the  search-based  approach  to  structured  prediction, 
called  HC-Search,  involves  first  defining  a  combinatorial  search  space  over  complete  structured 
outputs  that  allows  for  traversal  of  the  output  space  (Doppa  et  al.  2012).  Next,  given  a  structured 
input,  say  a  sequence  of  natural  language  words,  a  state-based  search  strategy  (e.g.,  best-first  or 
greedy  search)  is  employed  to  explore  the  space  of  possible  outputs,  e.g.,  sequence  of  part  of 
speech  tags,  for  a  specified  time  bound.  The  least  cost  output  uncovered  by  the  search  according 
to  a  learned  cost  function  C  is  then  returned  as  the  prediction. 

Our  learning  approach  is  motivated  by  our  observation  that  for  a  variety  of  structured  prediction 
problems,  if  we  use  the  true  loss  function  of  the  structured  prediction  problem  to  guide  the  search, 
the  high-quality  outputs  are  found  very  quickly.  This  suggests  that  similar  performance  might  be 
achieved  if  we  could  learn  an  appropriate  cost  function  to  guide  the  search  in  place  of  the  true  loss 
function  (because  the  true  cost  function  is  not  available  at  the  time  predictions  are  computed).  An 
advantage  of  our  search-based  approach,  compared  to  most  structured-prediction  approaches  like 
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conditional  random  fields  (CRFs)  is  that  it  scales  gracefully  with  the  complexity  of  the  cost  func¬ 
tion  dependency  structure.  In  addition  to  the  cost  function  used  to  evaluate  the  final  solutions,  the 
search  is  guided  by  a  heuristic  function  H  to  explore  more  promising  states  (Doppa  et  al.  2013, 
Doppa  et  al.  2014a,  Doppa  et  al.  2014b). 

The  goals  of  the  heuristic  function  and  cost  function  learning  are  to  rank  the  solutions  as  if  they 
were  using  the  true  loss  function  for  ranking  the  intermediate  and  final  outputs.  We  formulate  and 
solve  this  problem  in  the  framework  of  imitation  learning  by  viewing  the  search  algorithm  as  an 
expert  to  imitate  to  produce  the  target  output.  For  example,  the  heuristics  function  learns  to  rank 
states  that  lead  to  the  correct  target  output  before  the  ones  that  lead  to  incorrect  outputs  during 
search.  The  cost  function  learns  to  rank  the  correct  target  outputs  ahead  of  incorrect  target  outputs. 

We  obtained  competitive  results  for  part  of  speech  tagging  with  the  state  of  the  art  systems  based 
on  Conditional  Random  Fields  (CRFs)  (96.93%  vs.  96.84%)  and  for  chunking  (94.66%  vs. 
94.77%)  on  benchmark  datasets. 

One  of  the  key  insights  that  came  out  of  this  work  is  that  limited  discrepancy  search — which  ex¬ 
plores  a  space  of  possible  outputs  starting  with  a  greedy  initialization,  introducing  a  limited  number 
of  discrepancies  and  propagating  them  through  local  inference-is  very  effective  in  combining 
search  and  knowledge  to  quickly  find  good  outputs.  Another  surprising  lesson  is  that  although  both 
our  cost  function  and  our  heuristic  function  are  based  on  the  same  set  of  features,  and  both  operate 
on  complete  outputs,  the  distributions  of  ranking  problems  that  they  encounter  are  different  enough 
that  it  works  better  to  leam  two  different  functions  rather  than  sharing  the  same  function  for  both 
guiding  the  search  and  selecting  the  final  output. 

4.2  Co-reference  Resolution  via  Prune-and- Score 

Co-reference  resolution  can  be  viewed  as  clustering  sets  of  mentions  such  that  the  mentions  in  the 
same  cluster  refer  to  the  same  entity  (Ng,  2010).  In  our  search-based  formulation,  the  mentions 
are  processed  incrementally  from  left  to  right.  Each  search  state  corresponds  to  the  set  of  clusters 
created  by  the  prefix  of  mentions  already  processed.  Each  action  adds  the  next  mention  to  an 
existing  cluster  or  starts  a  new  cluster  with  that  mention.  We  employ  a  greedy  search  which  adds 
the  next  mention  to  the  cluster  that  yields  the  highest  additional  score. 

In  the  Prune-and-Score  approach  to  greedy  co-reference  resolution,  we  learn  two  heuristic  func¬ 
tions,  one  for  pruning  the  bad  merge  actions  and  the  other  to  select  the  best  among  the  remaining 
merge  actions  (Ma  et  al.  2014).  Both  of  these  heuristics  are  learned  by  imitating  the  decisions  of 
the  loss  function.  The  merge  actions  that  have  the  highest  loss  according  to  the  training  data  are 
the  candidates  for  pruning,  and  all  merge  decisions  that  contribute  zero  loss  are  considered  good 
for  selection.  Learning  occurs  by  adjusting  weights  of  the  heuristic  and  the  pruning  functions  so 
that  the  decisions  made  by  the  learned  heuristic  functions  are  consistent  with  the  training  data. 

The  Prune-and-Score  approach  gave  competitive  results  with  the  state  of  the  art  on  co-reference 
resolution  with  gold  mentions  in  multiple  datasets.  The  numbers  in  Table  1  show  the  CoNLL 
AVG-F1  scores,  which  is  the  standard  metric  for  this  competition,  for  our  system  compared  to  our 
system  without  the  pruning  (Score-only)  and  to  the  prior  state  of  the  art.  The  results  show  that  our 
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scores  are  competitive  with  the  state  of  the  art  in  ACE  2014  (Culotta  test  set)  and  Ontonotes  and 
improve  upon  the  state  of  the  art  on  ACE  2014  (Newswire)  and  MUC6  by  2  and  5  percentage 
points  respectively.  Interestingly,  pruning  improves  upon  the  score-only  approach  in  all  tests  by 
0.9  to  3.3  percentage  points.  This  shows  that  the  additional  expressive  power  to  learn  two  functions 
rather  than  one  is  worth  the  cost  and  mirrors  the  lesson  learned  from  HC-search  in  other  domains. 

Table  1:  Comparison  of  Prune-and- Score  to  prior  state  of  the  art  on 
benchmark  coref  datasets. 


Dataset 

Prune-and-Score 

Score- 

only 

Prior  State-of-the-Art 

Ontonotes 

80.26 

78.24 

80.16  (Durett  and  Klein  2013) 

ACE  2014  (Culotta  test  set) 

80.35 

78.24 

79.91  (Chang  et  al.  2013) 

ACE  2014  (Newswire) 

81.23 

80.31 

79.16  (Lee  et  al.  2013) 

MUC6 

78.56 

75.26 

73.16  (Lee  et  al.  2013) 

4.3  Easy-First  Cross-document  Co-reference  Resolution 

In  this  work,  we  address  cross-document  co-reference  of  events  (verbs)  in  addition  to  entities 
(nouns).  The  left-to-right  processing  of  mentions  is  sometimes  too  restrictive  and  is  inapplicable 
when  co-reference  resolution  is  required  across  multiple  documents.  In  the  “easy  first”  approach 
to  co-reference  resolution,  we  make  high  confidence  decisions  first,  which  then  make  other  deci¬ 
sions  easier  via  propagation  of  constraints  (Stoyanov  and  Eisner  2012).  Each  search  state  corre¬ 
sponds  to  a  clustering  of  all  mentions,  where  the  initial  state  corresponds  to  the  most  refined  clus¬ 
tering  with  each  mention  in  its  own  cluster.  The  actions  correspond  to  merging  pairs  of  clusters 
based  on  a  heuristic  evaluation  function  until  a  Halt  decision  is  made.  We  follow  the  greedy  heu¬ 
ristic  where  the  cluster  pair  with  the  highest  score  (or  the  Halt  action)  is  chosen  at  each  search  step. 

Our  contribution  to  easy  first  co-reference  resolution  is  a  principled  approach  to  learn  the  weights 
of  the  greedy  heuristic  function.  Our  learning  algorithm  is  based  on  adjusting  the  weights  of  a 
linear  classifier  using  an  online  passive  aggressive  update.  The  search  algorithm  makes  clustering 
decisions  greedily  in  the  order  suggested  by  a  ranking  function.  A  clustering  decision  is  “bad”  if 
it  is  not  consistent  with  the  training  data  and  “good”  otherwise.  A  previous  online  approach  to  easy 
first  co-reference  updates  the  weights  to  encourage  ranking  the  best  (highest  scoring)  good  deci¬ 
sion  ahead  of  the  best  (highest  scoring)  bad  decision  (Goldberg  and  Elhadad  2010).  We  call  this 
approach  best  good  vs.  best  bad  (BGBB).  One  problem  with  this  update  is  that  it  ignores  the  other 
bad  decisions  that  are  still  ranked  above  the  good  decision,  requiring  many  more  future  updates. 
Our  best  good  vs.  violated  bad  (BGVB)  takes  a  more  principled  approach  by  encouraging  the  best 
good  decision  to  lead  all  bad  decisions  that  rank  higher  so  that  the  good  decision  is  preferred.  The 
update  rule  is  derived  by  formulating  an  appropriate  convex-concave  optimization  problem  and 
solving  it  using  the  Majorization-Minimization  scheme  (Hunter  and  Lange,  2004). 

We  evaluated  our  method  on  cross-document  co-reference  for  both  event  and  entity  co-reference. 
As  shown  in  Table  2,  our  results  are  significantly  better  than  BGBB  and  slightly  better  than  the 
prior  results  of  (Lee  et  al  2012)  on  predicted  mentions  of  their  benchmark  dataset. 
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Table  2:  Comparison  of  cross-document  coreference  results  on  predicted  mentions. 


Dataset  (EECB  Corpus) 

BGVB 

BGBB 

Lee  et  al.  (2012) 

Entities  only 

54.40 

50.31 

54.21 

Events  only 

47.88 

40.70 

46.50 

Entities  and  Events 

55.80 

49.83 

55.74 

4.4  Event  Detection  via  Forward-Backward  Recurrent  Neural  Networks 


Most  work  we  described  so  far  has  assumed  labor-intensive  feature  engineering.  Recent  work  in 
language  processing  employed  deep  neural  networks  for  a  variety  of  tasks  from  low  level  tasks 
such  as  parsing  (Chen  and  Manning,  2014)  to  more  semantic  tasks  such  as  question  answering 
(Zhang  et  al.  2017).  The  neural  networks  avoid  feature  engineering  by  embedding  words  in  a  se¬ 
mantic  vector  space  based  on  the  contexts  of  their  use  (Pennington  et  al.,  2014).  Words  used  in 
similar  contexts  have  similar  embeddings. 


Our  group  has  pioneered  the  use  of  recurrent  neural  networks  for  detecting  multi-word  phrases 
that  indicate  the  presence  of  events  of  predefined  types  (Ghaeini  et  al.  2016).  Our  recurrent  neural 
network  architecture,  called  Forward-Backward  Recurrent  Neural  Network  (FB-RNN),  divides  the 
sentence  into  three  parts,  where  the  part  in  the  middle  looks  for  the  phrase  that  denotes  the  event, 
and  the  left  and  the  right  parts  capture  the  corresponding  contexts.  Each  word  is  replaced  by  its 
word  embedding  learned  from  a  corpus.  The  relative  position  of  the  word  in  the  sentence  is  cap¬ 
tured  separately  as  “branch  embedding”  and  concatenated  with  the  word  embedding.  The  embed¬ 
dings  of  the  left  and  the  middle  parts  of  the  sentence  are  processed  in  the  forward  direction  by  a 
recurrent  neural  network  (a  Gated  Recurrent  Unit  or  GRU)  while  the  right  part  is  processed  in  the 
backward  direction.  The  outputs  of  the  GRUs  are  concatenated  and  passed  through  a  fully  con¬ 
nected  neural  network  with  a  softmax  output  node  that  classifies  the  event  into  one  of  predefined 
types  or  the  'none’  type. 


Tables 


of  sentence 


nugget  and 
right  contexts 


Classifier 


Figure  1:  A  Forward-Backward  RNN  for  event  detection  applied  to  the  sentence  “An  un¬ 
known  man  had  broken  into  a  house  last  November.” 
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FBRNN  was  evaluated  on  ACE  2015  and  Rich  ERE  2015.  It  performed  competitively  on  ACE 
2015  compared  to  the  CNN-based  system  of  (Nguyen  and  Grishman  2015)  (FI  of  67.4  %  vs 
67.6%).  Its  performance  on  Rich  ERE  2015  was  about  0.8%  less  than  the  top  ranking  system  (FI 
of  57.61%  vs.  58.41%)  and  was  higher  than  all  other  submissions  to  the  TAC-KBP  competition. 
Compared  to  previous  CNN  based  approach,  FBRNN  also  has  the  advantage  that  it  is  capable  of 
detecting  multi-token  events  (event  nuggets). 

4.5  Learning  Scripts  via  Hidden  Markov  Models 

It  has  long  been  noted  that  natural  language  understanding  is  a  knowledge-intensive  task  (Wilks 
and  Charniak,  1976).  Peoples’  understanding  of  narrative  texts  is  vastly  enhanced  by  their 
knowledge  of  stereotypical  scripts  such  as  restaurants  and  birthday  parties  (Schank  and  Abelson, 
1977).  Scripts  capture  a  stereotypical  sequence  of  events  that  typically  occur  in  a  given  context 
while  allowing  for  variations.  There  has  been  a  resurgence  of  interest  in  learning  scripts  from  nat¬ 
urally  occurring  texts  (Chambers  2013).  One  of  our  main  contributions  was  to  formally  connect 
scripts  to  the  formalism  of  Hidden  Markov  Models  (HMM)  (Rabiner  1990)  and  derive  algorithms 
for  learning  them  from  simple  natural  language  texts  that  describe  various  scenarios.  In  our  frame¬ 
work,  the  states  of  the  HMM  correspond  to  the  events  in  the  text,  and  the  state  transitions  corre¬ 
spond  to  event  transitions  (Orr  et  al.  2014). 

One  key  missing  feature  in  the  standard  algorithms  for  HMMs  is  to  account  for  missing  observa¬ 
tions,  which  are  quite  common  in  text.  We  adapt  the  learning  and  inference  algorithms  for  HMMs 
to  text  by  allowing  any  event  to  be  missing  with  some  probability.  This  requires  the  algorithms  to 
maintain  two  indices  at  every  point  in  the  text,  one  that  corresponds  to  the  place  of  the  event  in  the 
narration  and  the  other  that  corresponds  to  the  place  of  the  event  in  the  complete  script  that  includes 
all  observations.  The  resulting  learning  and  inference  algorithms  are  general  and  applicable  to 
other  contexts  such  as  bioinformatics  where  missing  observations  are  also  common  (Krogh,  et.  al 
1994). 

Another  innovation  of  ours  is  to  learn  the  structure  of  the  HMM  through  bottom  up  merging  of 
event  sequences  extracted  from  individual  texts.  The  merging  is  guided  by  a  structure  search  pro¬ 
cedure  that  merges  states  and  removes  edges  and  scores  the  resulting  structures  by  a  combination 
of  data  likelihood  and  model  simplicity.  Each  step  in  structure  search  is  followed  by  parameter 
estimation,  which  is  heuristically  optimized  to  minimize  the  number  of  repeated  calculations.  For 
further  efficient  processing,  we  divided  the  documents  into  mini-batches,  merged  them  separately 
and  merged  the  results  with  the  full  script. 

We  evaluated  the  script  learning  algorithm  on  the  OMICS  corpus  of  simple  narrative  texts  about 
multiple  domains  collected  by  Honda  Research  Institute  (Gupta  and  Kochenderfer  2004).  We  se¬ 
lected  74  domains,  each  of  which  has  at  least  50  narratives  and  events  of  at  least  3  types.  Our 
algorithm  significantly  outperformed  the  other  baselines  that  did  not  take  into  account  the  missing 
observations  (46.0%  accuracy  vs.  42.1%).  Thanks  to  our  scoring  function  that  penalizes  the  com¬ 
plexity,  the  scripts  learned  by  our  algorithm  are  simpler  and  are  more  intuitive  than  the  other  base¬ 
lines. 
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Our  work  has  renewed  the  interest  of  NLP  community  in  script  learning.  Some  recent  papers  in¬ 
clude  (Chaturvedi  et.  al  2017;  Iyyer  et.  al.  2016;  Chaturvedi  et.  al  2016;  Ferraro  and  Van  Durme 
2016;  Pichotta  and  Mooney  2016). 


4.6  Multitask  Structured  Prediction 

In  this  ongoing  work,  we  are  exploring  several  search-based  approaches  to  the  problem  of  multi¬ 
task  structured  prediction  (MTSP)  in  the  context  of  multiple  entity  analysis  tasks  in  natural  lan¬ 
guage  processing  including  named  entity  recognition,  coreference  resolution,  and  entity  linking. 

We  have  studied  three  different  search  architectures  for  multi-task  structured  prediction  that  make 
different  tradeoffs  between  speed  and  accuracy  (Ma  et  al.  2017).  The  fastest  approach  to  multi¬ 
task  structured  prediction  is  the  independent  architecture,  where  each  task  is  solved  independently 
of  others.  While  it  has  the  advantages  of  simplicity  and  reduced  search  space,  the  independent 
architecture  does  not  benefit  from  mutual  constraints  that  arise  between  different  tasks. 

The  second  natural  candidate  is  the  joint  architecture,  where  we  treat  the  MTSP  problem  as  a  single 
task  and  search  the  joint  space  of  multi-task  structured  outputs.  Although  it  offers  an  elegant  uni¬ 
fied  framework,  the  joint  architecture  poses  a  major  challenge.  The  branching  factor  of  the  joint 
search  space  increases  in  proportion  to  the  number  of  tasks,  making  the  search  too  expensive.  Even 
single  tasks  such  as  co-reference  resolution  involve  large  branching  factors.  We  address  this  prob¬ 
lem  by  learning  pruning  functions  as  in  our  Prune-and-Score  approach. 

Finally,  we  studied  a  third  search  architecture  referred  to  as  cyclic ,  which  is  intermediate  in  com¬ 
plexity  between  the  above  two  architectures.  The  different  tasks  are  done  in  a  sequence,  and  re¬ 
peated  in  a  cycle  as  long  as  the  current  scoring  function  shows  improvements.  The  cyclic  architec¬ 
ture  has  the  advantage  of  not  increasing  the  branching  factor  of  the  search  beyond  that  of  a  single 
task,  while  taking  advantage  of  mutual  constraints  between  different  tasks. 

We  evaluated  search-based  multi-task  structured  prediction  for  entity  analysis  by  jointly  solving 
named  entity  recognition,  co-reference,  and  entity  linking  tasks  on  multiple  benchmark  datasets, 
namely  ACE  2005  and  TAC-KBP  2015  in  these  three  architectures.  The  results  are  summarized 
in  Table  3,  where  the  best  results  in  each  column  are  shown  in  bold.  For  the  NER  and  LINK  tasks, 
we  show  the  accuracy  percentages,  and  for  Coref  we  measure  the  CoNLL  score.  The  joint  archi¬ 
tecture  not  only  outperforms  the  performance  of  independent  tasks,  but  it  also  improves  over  the 
prior  state-of-the-art  approach  based  on  belief  propagation  in  graphical  models  (Durrett  and  Klein 
2014).  The  cyclic  architecture  offers  competitive  performance  at  a  reduced  computational  cost 
compared  to  the  joint  architecture  with  pruning.  The  last  column  for  each  dataset  shows  the  train¬ 
ing  time  in  minutes  and  seconds.  The  joint  architecture  with  pmning  is  the  most  expensive,  while 
the  cyclic  architecture  takes  a  relatively  modest  amount  of  time  more  than  the  independent  tasks. 
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Table  3:  Comparison  of  the  Independent,  Joint  and  Cyclic  architectures  for  NER,  link¬ 
ing  and  coreference  resolution  on  ACE  2005  and  TAC-EKBP  2015. 


Datasets 

ACE  2005 

TAC-KBP  2015 

Tasks 

NER 

Link 

Coref 

Time 

NER 

Link 

Coref 

Time 

Berkeley 

85.60 

76.78 

76.35 

31m 

88.90 

74.80 

82.98 

6m29s 

Independent 

82.24 

75.36 

75.04 

9  m 

87.30 

76.20 

81.21 

2m41s 

Joint  w  pruning 

87.18 

80.28 

77.85 

37  m 

89.33 

77.68 

83.17 

9m2s 

Cyclic 

84.18 

80.67 

77.29 

11  m 

89.57 

77.68 

82.08 

3m52s 

4.7  Cross-lingual  Entity  Linking 

We  also  participated  in  the  TAC-KBP  competitions  every  year  starting  from  2013  in  the  entity 
linking  and  event-argument  extraction  tasks,  culminating  in  our  final  system  for  the  Trilingual 
Entity  Discovery  and  Linking  (TEDL)  task  in  2016. 

The  TEDL  task  consists  of  assigning  the  corresponding  entities  in  the  knowledge  base  (KB)  to 
the  query  mentions  in  each  document,  and  cluster  the  mentions  into  corefering  sets  when  there  is 
no  corresponding  entity  (KB).  This  task  is  quite  challenging  because  the  coreference  clusters 
span  multiple  documents  in  possibly  different  languages. 

Our  system  is  based  on  a  cross-lingual  entity  linking  model  in  which  we  use  deep  learning  tech¬ 
niques  to  make  the  performance  less  sensitive  to  language  specifics.  Our  proposed  cross-lingual 
entity  linker  consists  of  mention  and  context  models.  The  mention  model  captures  the  lexical 
compatibility  between  the  mentions  and  the  entities  in  the  English  language.  Following  (Durrett 
and  Klein  2014),  we  also  define  a  latent  query  variable  for  each  mention  that  represents  the  most 
likely  prefix  that  generates  the  mention.  The  mention  model  is  a  loglinear  model  that  computes  a 
lexical  compatibility  score  between  a  mention  and  an  entity  marginalized  on  the  query  variable. 
The  model  uses  transliteration  to  obtain  the  mention-entity  features  for  non-English  languages. 
The  context  model  leverages  the  contextual  information  encoded  in  mention  and  entity  embed¬ 
dings  to  make  mention  model  less  sensitive  to  English-specific  features.  For  each  mention  and 
K=6  of  its  closest  mentions  in  the  embedding  space,  we  compute  and  sum  the  dot  products  be¬ 
tween  their  embeddings  to  get  the  context  model  score.  The  final  score  of  a  mention-entity  pair 
is  the  product  of  the  scores  of  the  mention  model  and  the  context  model. 

We  cluster  the  mentions  that  do  not  have  a  corresponding  entity  in  the  KB  into  corefering  sets 
using  within-document  and  cross-document  coreference  techniques.  For  within-document  coref¬ 
erence,  we  use  the  Prune-and-Score  system  described  in  Section  3.2.  The  cross-document  coref¬ 
erence  is  done  by  a  rule -based  agglomerative  clustering  algorithm  similar  to  Stanford’s  multi¬ 
sieve  system  (Lee  et  al.,  2011).  However,  unlike  the  Stanford’s  system,  which  applies  rules  se¬ 
quentially,  our  system  computes  a  score  for  each  cluster  pair  based  on  rules  that  judge  the  com¬ 
patibility  of  each  pair  of  mentions.  The  score  of  the  cluster  pair  is  the  fraction  of  compatible  pairs 
of  mentions  in  the  two  clusters.  We  sequentially  merge  the  pairs  of  clusters  whose  score  exceeds 
a  preset  threshold.  More  details  can  be  found  at  (Shahbazi  et.  al  2016). 
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Our  system  was  ranked  6th  among  12  systems  in  the  first  window  of  evaluation  of  TADL  task  in 
KBP-2016  according  to  the  mention  CEAF  measure  (Ji  et  al.  2016).  This  measure  finds  the  opti¬ 
mal  alignment  between  system  and  gold  standard  clusters,  and  then  evaluates  the  precision  and 
recall,  micro-averaged  over  mentions.  We  ranked  8th  in  the  second  window  of  evaluation,  alt¬ 
hough  the  performance  of  our  system  has  improved  beyond  the  first  window.  However,  as  the 
systems  were  allowed  to  use  the  other  systems’  outputs  in  the  second  window,  the  relative  rank¬ 
ings  are  less  meaningful. 


5.  CONCLUSIONS  AND  FUTURE  WORK 

In  summary,  our  research  shows  that  search-based  structured  prediction  has  good  potential  in 
multiple  subtasks  of  language  understanding  and  is  competitive  with  other  methods  based  on 
graphical  models  and  optimization.  Our  latest  work  on  multi-task  structured  prediction  shows 
that  the  search-based  approach  makes  it  easy  to  combine  multiple  subtasks  into  a  unified  frame¬ 
work  and  yields  superior  performance  at  only  a  modest  cost.  We  have  also  begun  to  explore  neu¬ 
ral  network-based  models  that  avoid  extensive  feature  engineering  and  yield  highly  competitive 
results.  We  point  out  the  following  opportunities  for  future  research,  some  of  which  we  have  al¬ 
ready  begun. 

1.  Combine  the  neural  network  models  with  search-based  structured  prediction  to  jointly 
solve  multiple  tasks  to  enable  superior  performance  without  feature  engineering. 

2.  Explore  other  architectures  for  multi-task  structured  prediction  that  improve  accuracies 
further  with  little  loss  in  computational  efficiency.  The  cyclic  architecture  we  developed 
is  very  promising  in  this  regard  and  could  lead  to  greater  gains  with  further  optimizations, 
e.g.,  change  propagation. 

3.  Integrate  the  entity  discovery  and  linking  task  with  the  event-argument  extraction  task  to 
build  a  more  comprehensive  language  understanding  system. 

4.  Investigate  ways  to  combine  inference  and  learning  in  multiple  modalities  such  as  lan¬ 
guage  and  vision. 

5.  Systematically  integrate  our  system  into  a  knowledge  based  system  framework  by  com¬ 
bining  the  inferences  from  different  subsystems  in  a  principled  manner  while  taking  into 
account  the  confidences  of  their  predictions.  This  problem  of  building  an  integrated  AI 
systems  in  a  principled  manner  is  a  woefully  under- studied  problem  with  a  few  notable 
exceptions  (Dietterich  and  Bao,  2008). 
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